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Abstract 

We consider inapproximability of the correlation clustering problem defined as follows: Given 
a graph G = (V,E) where each edge is labeled either "+" (similar) or "— " (dissimilar), correla- 
tion clustering seeks to partition the vertices into clusters so that the number of pairs correctly 
(resp. incorrectly) classified with respect to the labels is maximized (resp. minimized). The 
two complementary problems are called MaxAgree and MinDisagree, respectively, and have 
been studied on complete graphs, where every edge is labeled, and general graphs, where some 
edge might not have been labeled. Natural edge-weighted versions of both problems have been 
studied as well. Let S-MaxAgree denote the weighted problem where all weights are taken 
from set S, we show that <S-MaxAgree with weights bounded by OdV] 1 ^ 2 ^ 6 ) essentially belongs 
to the same hardness class in the following sense: if there is a polynomial time algorithm that 
approximates 5-MaxAgree within a factor of A = 0(log | | ) with high probability, then for any 
choice of S', S'-MaxAgree can be approximated in polynomial time within a factor of (A + e), 
where e > can be arbitrarily small, with high probability. A similar statement also holds for 
5-MinDisagree. This result implies it is hard (assuming MV ^ 1ZV) to approximate unweighted 
MaxAgree within a factor of 80/79 — e, improving upon a previous known factor of 116/115 — e 
by Charikar et. al. g]Q 
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1 Introduction 

Motivated by applications of document clustering, Bansal, Blum and Chawla [5] introduced the 
correlation clustering problem where for a corpus of documents, we represent each document by 
a node, and an edge (u,v) is labeled "+" or "— " depending on whether the two documents are 
similar or dissimilar, respectively The goal of correlation clustering is thus to find a partition of 
the nodes into clusters that agree as much as possible with the edge labels. Specifically, there 
are two complementary problems. MaxAgree aims to maximize the number of agreements: 
the number of + edges inside clusters plus the number of — edges across clusters; on the other 
hand, MinDisagree aims to minimize the number of disagreements: the number of + edges 
across different clusters plus the number of — edges inside clusters. Correlation clustering is also 
viewed as a kind of agnostic learning problem [9] and seems to have been first studied by Ben- 
Dor et al. [3] with applications in computational biology; Shamir et al. [10] were the first to 
formalize it as a graph-theoretic problem, which they called Cluster Editing. Since Bansal et al.s 
independent introduction of this problem [2] , it has been studied quite extensively in recent years 
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throughout the paper, when we talk about approximation factors we adopt the convention of assuming the factor 
is greater than 1 for both maximization and minimization problems. 



MaxAgree and MinDisagree have been studied on complete graphs, where every edge is 
labeled, and general graphs, where some edge might not have been labeled. The latter captures 
the case where a judge responsible for producing the labels is unable to tell if certain pairs are 
similar or not. Also, it is natural for the judge to give some 'confidence level' for the labels 
he produces; this leads to the natural edge-weighted versions, which we call S-MaxAgree and 
5-MinDisagree respectively, indicating the edge weights are taken from set iS. 

The various versions of correlation clustering arc fairly well studied. For complete unweighted 
case, Bansal et al. [2] gave a PTAS for MaxAgree and Charikar et al. [4] gave a 4-approximation 
for MinDisagree and showed APX-hardncss. For general weighted graphs, an 0(log n)-approximation 
algorithm was also given in [4] for MinDisagree, and algorithms with the same approximation 
factor were also obtained independently by Dcmaine and Immorlica [5], and Emanuel and Fiat 
El' a oTg64 _a PP rox i ma tion algorithm was given for MaxAgree in [1], and this was improved by 
Swamy 1 11 j with a 7 g 66 -approximation algorithm. 

In this paper, we focus on the general graph case. Our main contribution is to show iS- 
MaxAgree (rcsp. 5-MaxAgree) with absolute values of weights bounded by Od^l 1 / 2 " 5 ) 
belongs to the same hardness class in the following sense: if there is a polynomial time algorithm 
that approximates 5-MaxAgree (resp. 5-MaxAgree) within a factor of A = 0(log|F|) with 
high probability, then for any choice of S' , tS'-MAX Agree (rcsp. 5-MaxAgree) can be approxi- 
mated in polynomial time within a factor of (A + e), for any constant e > 0, with high probability. 
This result implies it is hard (assuming AfV ^ TZV) to approximate unweighted MaxAgree 
within a factor of 80/79 — e, improving upon a previous known factor of 116/115 — e by Charikar, 
Guruswami and Wirth [?]. 

Theorem 1 ([4]) For every e > 0, it is MV-hard to approximate the weighted version of MAX- 
Agree within a factor of 80/79 — e. Furthermore, it is AfV -hard to approximate the unweighted 
version of MaxAgree within a factor of 116/115 — e. 



2 Definitions and Notations 

We give definitions and notations in this section. 



Definition 1 (5-MaxAgree) A MaxAgree problem is called 5-MaxAgree if all edge weights 
are taken from set S. An element in S can be either a constant or some function in the size of 
the input graph. 

iS-MinDisagree is defined likewise. We assume is always an element in S as we are interested 
in the problem on general graphs in this paper. Assigning weight to non-edges allows us to view 
any general graph as a complete one. 

Definition 2 (N-fold Roll) Given a graph G = {V,E) where V = {vi,V2, ...,«„}. Let (N — 1) be 
a multiple of (n— 1), an N-fold roll (denoted by G N ) of G is created by embedding multiple copies 
of G into an N by n grid where there are N parallel copies of V and a node Vij corresponds to Vj 
in the ith copy of V . 

Edges of G are created as follows. For any pair of nodes where i\,i 2 G 

{1, 2, N}, ji,j2 £ {l,2,...,n}. Define the 'wrapped- around' vertical distance of the two nodes 



d(vi 1 j 1 , Vi 2 j 2 ) 



(i 2 -ii mod N) (ji < j 2 ) 

oo {otherwise) 



A pair (vi 1 j 1 ,Vi 2 j 2 ) * s called a grid-bone if and only if 

1) h +h\ and 



2) %m.%«) g{ 1 jV=i}. 
' 32—31 L ' ' ' n— 1 J 
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A grid-bone (Vi^ , Vi 2 j 2 ) is an edge identical to {vj 1 ,Vj 2 ) (resp. non-edge), depending on whether 
(vj 11 Vj 2 ) is an edge (resp. non-edge) in G. All non-grid-bone pairs (v^ ^ , Vi 2 j 2 ) are non-edges. 

Note by construction G N consists of exactly N(^j- + 1) > ^— duplicates of G. It is conceptually 
easier to see this by indexing each duplicate with pair (i, c), where i € {1,2,..., N} indexes the N 
parallel copies of V and c 6 {0, 1, ^Ej-} can be thought of as the 'slope' of the grid-bones in 
this copy. More precisely, duplicate (i,c) consists of nodes 

{ v (i mod 2V)li v (i+c mod N)2i v (i+2c mod 7V)3i ■••) v (i+(n-l)c mod N)n} 

For our purpose that will be evident in the rest of the paper and for the sake of simpler analysis, 
we assume w.l.o.g. that there are exactly ^- duplicates of G. Note this can be thought of as 
erasing all edges on (any) excessive -/V(^3f + 1) — ~ duplicates. 

In this construction, we obtain disjoint duplicates of E from just N disjoint duplicates of 
V, this asymptotic gap is crucial in our proof of the main technical results (i.e. Lemma [3] and |4|) . 
We will discuss why we need this gap in the proof of Lemma [3l 

Definition 3 (S-to-{— a, 0, /?} randomized rounding) 

Input: An instance of <S-MaxAgree (S-MinDisagreeJ on general graph G = (V, E), where 
w.l.o.g. it is assumed that 7 < 1, V 7 G S; and a,f3> 1. 

Output: The same graph with the following randomized rounding. For each edge of weight 7 > 
(resp. 7 < 0), round 7 to either or (3 (resp. either —a or 0) independently and identically at 
random with expectation being 7. 

Denote by w(-) the weight function before rounding, and w'(-) the one after rounding. We 
slightly abuse notation here by allowing both weight functions to take edges and clusterings as 
parameter. For a clustering C, denote by w'y(C) the total post-rounding weight of C contributed 
by former-7-edges. 

Definition 4 (Contributing) Given an 5-MaxAgree (resp. 5-MinDisagreeJ instance and a 
clustering C, we call an edge (i,j) of weight 7 a contributing edge iff 7 > (resp. 7 < 0) and 
(i,j) is inside a cluster of C , or 7 < (resp. 7 > 0) and {i,j) is cross different clusters of C . 

3 Main Theorems 

Given an <S-MaxAgree (resp. 5-MinDisagree) instance, first construct an iV-fold roll G N = 
(V N ,E ), and then apply the 5-to-{— a, 0, [3} randomized rounding on G N . If we solve the 
{—a, 0, [3} instance on G N , the solution clustering C implies a total of — (not necessarily distinct) 
ways to cluster G, one for each of the — duplicates of G. To see this, note C is simply a partition 
of V N , and this partition induces a partition, thus a clustering, on each of the — duplicates 
of G. We call each of these clusterings a candidate solution to the initial 5-MaxAgree (resp. 
5-MinDisagree) instance on G and denote them as C±, C2, C N 2 . 

Note although these — duplicates of G share nodes of G N , their edge sets are disjoint. In 
fact, these ^— duplicates of E form a partition of E N . Lemma [T] is immediate. 

Lemma 1 For both 5-MaxAgree and 5-MlNDlSAGREE, w(C) = Y^Ll™ w { C i)- 

Our next lemma says that if an edge is not contributing before rounding, it must not be 
contributing after rounding. Therefore, to calculate the weight of C both before and after the 
rounding, we only need to concern ourself with the same set of edges. 
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Lemma 2 For both 5-MaxAgree and 5-MinDisagree, let E(C) be the set of contributing 
edges of C before randomized rounding is applied to G , i.e. w(C) = X)ees(O) w ( e )- Then after 
rounding, the new weight of C is still a summation over the same set of edges, i.e. w'(C) = 
E e es(c) w '( e )- 

Proof. This follows from the observation that positive edges are rounded to have either positive 
or zero weights, and negative edges are rounded to have either negative or zero weights. □ 

We are now ready to give our main technical result in Lemma [3J We concern oursclf only with 
<S-MaxAgree here; a similar result holds for 5-MinDisagree and is given in Lemma 0J 

Lemma 3 Given an «S-MaxAgree instance G = (V, E), let G N = (V N ,E N ) be the N -fold roll 
of G with S-to-{— a, 0, /?} randomized rounding applied. If 

1. a + 13 = OiiNn) 1 ' 2 - 8 ), where 5 G (0, \); and 

2. there is a A- approximation algorithm for {— a, 0, /3}-MaxAgree, where A = O(logn) 

then for any arbitrarily small number e > there exists a polynomial time algorithm that approx- 
imates 5-MaxAgree within a factor of (A + e) with probability at least \. 

Proof. For any 7 £ S, let X^ 1 ' denote the random variable representing the new weight of a 
former-7-edge after rounding. Define random variable = X^ —7; clearly £?[y( 7 )] = 0. Note 
it is assumed w.l.o.g. that I7I < 1, V 7 e <S. 

Suppose there is a polynomial time algorithm A that approximates {— a, 0, /3}-MaxAgree 
within a factor of A, we can then run A on G N , the output clustering corresponds to ^— ways 

to cluster G (not necessarily all distinct) . Let C* be the most weighted among these ^- clusterings 
of G, in the rest of the proof we show that with high probability, Cj* is a (A + e)-approximation 
of 5-MaxAgree on G for any fixed e. 

Denote by E the bad event that Cf does not imply a (A + e)-approximation on G, i.e. CI 
is not a (A + e)-approximation. Let C' be an arbitrary clustering of G N that does not imply a 
(A + e)-approximation on G. Denote by E(C") the event that C" becomes a A-approximation on 
G N after rounding. Since there are at most (Nn) Nn distinct clusterings of G N , by union bound 
we have Pr {E} < e NnlnNn ■ Pr{E(C")}- (We note that the randomness of event E(C") comes 
from the randomized rounding and the randomness of event E comes from both the randomized 
rounding and the internal randomness of A.) 

Let the weight of an optimal clustering U of G be K, denote by U N the corresponding 
duplication clustering in G . That is, U N has the same number of clusters as U, and there is a 
one-to-one mapping between the two sets of clusters such that a node Vj is in a cluster of U if 
and only if all its N duplicates, vij,V2j, vjvj, are in the corresponding cluster of U N . We now 
proceeds to prove that event E(C") happens with negligible probability. Before delving into the 
details, we first offer a high level discussion of the idea behind the proof. 

Intuition Behind the Proof. Since U is an optimal clustering ofG, by Lemma\J\it is easy to 
see that U N is an optimal clustering of G N before randomized rounding and its weight is KN . 
C' is an arbitrary but fixed clustering. Since it does not imply a (A + ^-approximation on G, it 
must be the case that before rounding the weight of C' on G N is less than rj~j~v^ ■ Since e is a 
fixed constant, this leaves a gap between ^+t)n an< ^ K \n ■ By Lemma [H the expectation of the 
new weight of U N is K ^ and that of C' is at most (^+e)n • Therefore for the bad event E(C") 
to happen either C' has to be really lucky in the rounding so that its new weight ends up hitting 
as high as K £[ , or U N has to be really unlucky in the rounding so that its new weight ends up 
touching as low as A^rr , or mostly likely some sort of combination of the two. Whichever case 



4 



happens, the common thing shared is that one has to rely on pure chance to close the gap. And 
we show that by setting N = poly(n) sufficiently large, this happens with negligible probability. In 
fact, the probability o/E(C") is so small that even (Nn) Nn times of it is still negligible. 

We now resume the proof. For any 7 g ft and a clustering C of G , denote by E(C,j) the 
set of former- 7-edges that are contributing in C before rounding. If E(C") happens, then 

E ■ M < = XT7 f E |s(t/W ' 7)l • 171 

where the first inequality follows because C' is a A-approximation of G^, and X^es^-Y (') ^ s ^ ne 
total weight of a clustering after rounding; the second inequality follows from Lemma [T] and the 
fact that each of the ^- candidate solutions implied by C' has weight less than Simple 
manipulation of the two inequalities above yields 

*-f>MST5(EW".7)l-w)-^ 



j ( E K(f ' 



where ft = [ £ «(G') - |£(G', 7 )| ■ M) ] and ft = - | £ «(0 - \E(U N ,j)\ ■ 111) 

ft eN 2 

Since A = Oflog (Nn)) = O(logn) and K > 1, when n is sufficiently large, ft — > — 5-. This 

A 71^ 

implies 

f & eiV 2 
Pr{E(C')} < Pr i ft - y > 

Note the expectation of both ft and ft are 0, therefore so is the linear combination ft — ft/A; 
in the following we argue that the probability for ft — ft /A to deviate from its mean by eN 2 /n 2 
is negligible when N is sufficiently large. 

For any7 € ft let 21(7) = \E(C, 7) — E(U N , 7) be the number of former-7-edges contributing 
in C" but not U N before rounding. Similarly, define z 2 (~f) = \E(U N ) - E{C')\ and 23(7) = 
1^(17^) nE(C')\. We have 

ft eiV 2 
Pr{Sl -~ > — 



pr E E ^> + t E (-^ (7) ) + ^ E n w > e 



A z — ' 3 ' " A 

1(7) 22(7) 23(7) \ - r 2 "1 

< ME(E^ 7) + E(^ 7) )+E^ 7) >^ (a>i) 

^es \ i=i 3=1 k=i J ) 

£ 5X,( Pr {f ( - lW> 2 (mtabomd) 

^ E E (expf -2z„( 7 ) f £ f ) )] (Hocffding bound) 

< 3|5| • exp (- Cl ■ —^—] (\S\<n 2 , z h (-f)<N 2 n) 



n 8 (a + /3) ; 
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where ci is some constant. Since we allow a + (3 = O^Nn)^ 1 ^ 2 "^) and want (Nn) Nn ■ Pr(K(C')) 
to be negaligible, it is now clear why we need N 2 /n duplicates of E and thus the N-fold roll 
construction given in Definition [2j In contrast, had we adopted a naive construction with iV 
isolated duplicates of G, there will be only ./V duplicates of E; and it is readily verified that this 
is insufficient to prove that (Nn) Nn ■ Pr(E(C")) is negligible. 
Now set N = n e / s , we have 

Pr {E} < (Nn) Nn ■ Pr {E(C")} < 3n 2 • exp ((6/5 + l)n 6 / 5+1 Inn - c 2 • „(6/<5+2+2«5)^ 

for some constant ci. Note this probability is bounded by \ as the input size n is sufficiently 
large. Therefore we have obtained a polynomial time algorithm that approximates S-M AxAgree 
within a factor of A + e with probability at least ^ . □ 

We give a similar result for 5-MinDisagree in Lemma 21 the proof follows essentially exactly 
the same construction and analysis as Lemma [3] so we only give a high level discussion without 
duplicating the proof. 

Lemma 4 Given an S-MinDisagree instance G = (V, E), let G N = (V N ,E N ) be the N-fold 
roll of G with S-to-{— a, 0, /?} randomized rounding. If 

1. a + 13 = OiiNn) 1 / 2 - 8 ), where 5 £ (0, §]; and 

2. there is a X- approximation algorithm for {— a, 0, /3}-MlNDlSAGREE, where A = 0(logn) 

then for any arbitrarily small number e > there exists a polynomial time algorithm that approx- 
imates 5-MinDisagree within a factor of (A + e) with probability at least |. 

Proof. (Sketch) We define U and C' analogously as in that of Lemma [3] The weight of 
U N before rounding is KN * , and the weight of C" before rounding is greater than ^ A+£ ^ gjv . 
Again since e is a fixed constant, there is a gap between ( x+e ^ N anc | hEM. _ p or (j< to be 
a A-approximation after rounding, its new weight must necessarily be at most A times of the 
new weight of U N . Since the expectation of the new weight of U N is K ^ and that of C' is 

greater than ( x+e ) KN ; again wc need to rely on chance to close this gap of eKN . By applying 
a similar analysis as in Lemma |3j we can show that even (Nn) Nn times of this probability, which 
upper bounds the probability of the bad event that a A-approximation on G N does not imply a 
(A + e)-approximation on G, is negligible. □ 

Lemma [3J and S] leads to the following theorem. 

Theorem 2 If S -Max Agree (resp. 5-MinDisagree^) is MV -hard to approximate within a fac- 
tor of X (X = O (log n)) for any specific choice of S, then for any choice of S', where max 7e ,s< |"y| = 
0(n 1 ^ 2 ^ s ) for some S <E (0, -|], no polynomial time algorithm, possibly randomized, can approxi- 
mate S' -Max Agree (resp. S'-MinDis agree,) within a factor of X + e with probability at least \ 
unless MV = VJP . 

Proof. This follows from Lemma [3J and [4] by setting a = — min<S and [3 = max S. □ 

In particular, invoking the result by Charikar et al. in Theorem Q] leads to the following improved 
inapproximability result. 

Theorem 3 No polynomial time algorithm, possibly randomized, can approximate unweighted 
version of Max Agree in general graphs within a factor of 80/79 — e unless NT = VJP . 
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